Skip to content

Conversation

@Jefffrey
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Make it easier to write distinct variations of aggregate functions be refactoring some of the common code together; specifically how they handle maintaining the complete set of distinct primitive values, as this code was duplicated across different functions.

What changes are included in this PR?

Introduce new GenericDistinctBuffer which has methods similar to Accumulator to manage an internal HashSet of values, so implementations like percentile_cont and sum can use it internally and only implement their own evaluate functions.

Are these changes tested?

Existing tests.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the functions Changes to functions implementation label Oct 29, 2025
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if I can pull in PrimitiveDistinctCountAccumulator to the deduplication as well, however it is specialized for types which don't need to hash through Hashable (aka non-float types) and I think there might be a performance hit if I try force them to use Hashable 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we definitely don't want to be hashing if we can avoid taht

/// `merge_batch` and a `Vec` of `ArrayRef` that are converted to scalar values
/// in the final evaluation step so that we avoid expensive conversions and
/// allocations during `update_batch`.
pub struct GenericDistinctBuffer<T: ArrowPrimitiveType> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main implementation here; I toyed with the idea of making this implement Accumulator and have the different functions (like median and percentile_cont) provide their evaluate logic as a closure but it got a bit messy; so for now they delegate their state/update_batch/merge_batch to this inner struct, which allows them to grab the final set of distinct values for them to do their own evaluate

@Jefffrey Jefffrey marked this pull request as ready for review October 29, 2025 09:13
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Jefffrey -- this is really quite elegant. I am sorry it took so long to review

the only thing I think we need to do is ensure this doesn't have any impact in performance (I don't expect that it will but want to be sure)

Really nice 🏆

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we definitely don't want to be hashing if we can avoid taht

self.values.extend(arr.iter().flatten().map(Hashable));
} else {
self.values
.extend(arr.values().iter().cloned().map(Hashable));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice -- this is an elegant way to special case nulls/non nulls

@alamb
Copy link
Contributor

alamb commented Nov 11, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing refactor-agg-distinct (3c389c0) to 6cc73fa diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Nov 11, 2025

🤖: Benchmark completed

Details

Comparing HEAD and refactor-agg-distinct
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ refactor-agg-distinct ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2681.40 ms │            2730.16 ms │ no change │
│ QQuery 1     │  1249.36 ms │            1297.64 ms │ no change │
│ QQuery 2     │  2414.27 ms │            2486.04 ms │ no change │
│ QQuery 3     │  1151.81 ms │            1187.91 ms │ no change │
│ QQuery 4     │  2243.87 ms │            2261.10 ms │ no change │
│ QQuery 5     │ 27784.40 ms │           28019.72 ms │ no change │
│ QQuery 6     │  4183.57 ms │            4200.12 ms │ no change │
│ QQuery 7     │  3719.32 ms │            3654.33 ms │ no change │
└──────────────┴─────────────┴───────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                    │ 45428.00ms │
│ Total Time (refactor-agg-distinct)   │ 45837.03ms │
│ Average Time (HEAD)                  │  5678.50ms │
│ Average Time (refactor-agg-distinct) │  5729.63ms │
│ Queries Faster                       │          0 │
│ Queries Slower                       │          0 │
│ Queries with No Change               │          8 │
│ Queries with Failure                 │          0 │
└──────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ refactor-agg-distinct ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.19 ms │               2.58 ms │  1.18x slower │
│ QQuery 1     │    49.89 ms │              51.05 ms │     no change │
│ QQuery 2     │   139.29 ms │             140.00 ms │     no change │
│ QQuery 3     │   154.15 ms │             164.65 ms │  1.07x slower │
│ QQuery 4     │  1071.27 ms │            1091.98 ms │     no change │
│ QQuery 5     │  1496.79 ms │            1516.96 ms │     no change │
│ QQuery 6     │     2.17 ms │               2.22 ms │     no change │
│ QQuery 7     │    55.06 ms │              56.95 ms │     no change │
│ QQuery 8     │  1417.71 ms │            1486.53 ms │     no change │
│ QQuery 9     │  1807.14 ms │            1820.02 ms │     no change │
│ QQuery 10    │   389.54 ms │             410.87 ms │  1.05x slower │
│ QQuery 11    │   446.25 ms │             456.15 ms │     no change │
│ QQuery 12    │  1386.50 ms │            1429.53 ms │     no change │
│ QQuery 13    │  2162.95 ms │            2173.12 ms │     no change │
│ QQuery 14    │  1285.62 ms │            1307.88 ms │     no change │
│ QQuery 15    │  1220.96 ms │            1255.23 ms │     no change │
│ QQuery 16    │  2689.84 ms │            2738.54 ms │     no change │
│ QQuery 17    │  2657.93 ms │            2714.66 ms │     no change │
│ QQuery 18    │  5395.06 ms │            4982.75 ms │ +1.08x faster │
│ QQuery 19    │   128.10 ms │             129.16 ms │     no change │
│ QQuery 20    │  2065.75 ms │            1967.07 ms │     no change │
│ QQuery 21    │  2324.17 ms │            2328.12 ms │     no change │
│ QQuery 22    │  4167.13 ms │            3931.84 ms │ +1.06x faster │
│ QQuery 23    │ 13031.85 ms │           12929.07 ms │     no change │
│ QQuery 24    │   219.94 ms │             221.62 ms │     no change │
│ QQuery 25    │   523.46 ms │             522.58 ms │     no change │
│ QQuery 26    │   226.94 ms │             220.89 ms │     no change │
│ QQuery 27    │  2869.38 ms │            2858.88 ms │     no change │
│ QQuery 28    │ 22775.69 ms │           24278.99 ms │  1.07x slower │
│ QQuery 29    │   967.16 ms │             969.06 ms │     no change │
│ QQuery 30    │  1337.50 ms │            1336.90 ms │     no change │
│ QQuery 31    │  1380.43 ms │            1343.02 ms │     no change │
│ QQuery 32    │  4624.40 ms │            4955.25 ms │  1.07x slower │
│ QQuery 33    │  5922.16 ms │            5932.37 ms │     no change │
│ QQuery 34    │  5927.20 ms │            5982.00 ms │     no change │
│ QQuery 35    │  2003.94 ms │            2024.77 ms │     no change │
│ QQuery 36    │   121.02 ms │             119.85 ms │     no change │
│ QQuery 37    │    52.78 ms │              52.00 ms │     no change │
│ QQuery 38    │   121.42 ms │             121.59 ms │     no change │
│ QQuery 39    │   196.08 ms │             200.65 ms │     no change │
│ QQuery 40    │    43.86 ms │              42.67 ms │     no change │
│ QQuery 41    │    40.35 ms │              38.89 ms │     no change │
│ QQuery 42    │    33.88 ms │              33.75 ms │     no change │
└──────────────┴─────────────┴───────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                    │ 94934.87ms │
│ Total Time (refactor-agg-distinct)   │ 96342.63ms │
│ Average Time (HEAD)                  │  2207.79ms │
│ Average Time (refactor-agg-distinct) │  2240.53ms │
│ Queries Faster                       │          2 │
│ Queries Slower                       │          5 │
│ Queries with No Change               │         36 │
│ Queries with Failure                 │          0 │
└──────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ refactor-agg-distinct ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 173.41 ms │             170.27 ms │     no change │
│ QQuery 2     │  27.72 ms │              26.96 ms │     no change │
│ QQuery 3     │  37.42 ms │              36.05 ms │     no change │
│ QQuery 4     │  28.96 ms │              28.30 ms │     no change │
│ QQuery 5     │  98.00 ms │              75.98 ms │ +1.29x faster │
│ QQuery 6     │  26.36 ms │              19.60 ms │ +1.34x faster │
│ QQuery 7     │ 259.23 ms │             221.02 ms │ +1.17x faster │
│ QQuery 8     │  32.46 ms │              33.29 ms │     no change │
│ QQuery 9     │ 103.31 ms │             105.18 ms │     no change │
│ QQuery 10    │  61.69 ms │              60.46 ms │     no change │
│ QQuery 11    │  17.21 ms │              16.20 ms │ +1.06x faster │
│ QQuery 12    │  51.71 ms │              51.02 ms │     no change │
│ QQuery 13    │  46.99 ms │              48.51 ms │     no change │
│ QQuery 14    │  13.95 ms │              13.82 ms │     no change │
│ QQuery 15    │  24.39 ms │              24.68 ms │     no change │
│ QQuery 16    │  24.45 ms │              24.85 ms │     no change │
│ QQuery 17    │ 153.72 ms │             149.28 ms │     no change │
│ QQuery 18    │ 329.41 ms │             329.57 ms │     no change │
│ QQuery 19    │  37.16 ms │              36.93 ms │     no change │
│ QQuery 20    │  50.78 ms │              49.59 ms │     no change │
│ QQuery 21    │ 348.92 ms │             333.10 ms │     no change │
│ QQuery 22    │  20.02 ms │              19.94 ms │     no change │
└──────────────┴───────────┴───────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                    │ 1967.28ms │
│ Total Time (refactor-agg-distinct)   │ 1874.61ms │
│ Average Time (HEAD)                  │   89.42ms │
│ Average Time (refactor-agg-distinct) │   85.21ms │
│ Queries Faster                       │         4 │
│ Queries Slower                       │         0 │
│ Queries with No Change               │        18 │
│ Queries with Failure                 │         0 │
└──────────────────────────────────────┴───────────┘

@Jefffrey
Copy link
Contributor Author

The clickbench QQuery0 (I believe it's this query?) that is 1.18x slower doesn't use distinct so I don't think it's an actual slowdown.

@Jefffrey Jefffrey added this pull request to the merge queue Nov 13, 2025
Merged via the queue into apache:main with commit e42a0b6 Nov 13, 2025
28 checks passed
@Jefffrey Jefffrey deleted the refactor-agg-distinct branch November 13, 2025 10:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants